FEAT: Add word-game option to DecompositionConverter#2051
Conversation
adrian-gavrila
left a comment
There was a problem hiding this comment.
Thanks for the contribution! A few small things worth attention but overall looks great
…mposition-word-game
|
@adrian-gavrila Thanks for the review. Addressed all three: codeword uniqueness is now validated in init with a test, the arg docstring is trimmed, and the overflow message states the threshold breach instead of a count. |
adrian-gavrila
left a comment
There was a problem hiding this comment.
Looks good! Thank you for addressing the comments.
…mposition-word-game
…overflow and empty response retryable, validate empty codewords/phrases
|
@romanlutz Thanks for the review. All three points are addressed:
Following the same "model output and user config are both unpredictable" idea, I hardened a few adjacent
Diff coverage on the changed lines is complete (escaping, Arabic, overflow recovery, empty |
Description
This adds an optional word-game mode to
DecompositionConverter(the DrAttack decompose-and-reconstruct converter from #2003), viause_word_game: bool = False. When enabled, each harmful noun phrase is replaced by an innocuous codeword in the reconstruction questions, and a mapping preamble (for example'apple' means 'a bomb') is established in the same prompt. This is the second half of DrAttack: it further conceals the harmful nouns by splitting them from the request behind codewords.Off by default, so the merged converter behaviour is unchanged.
Two design choices worth flagging up front:
Inline, not a separate prepended conversation. We had discussed the word-game as a prepended/simulated conversation; I went with inline (preamble and reconstruction in one prompt) for two reasons. First, coupling: the codewords must match the reconstruction the converter builds, and a separate conversation generates its turns independently, so they cannot share the mapping without a stateful component (an attack class), which we wanted to avoid. Inline keeps it a pure converter. Second, the numbers, inline matches the two-turn version, and both are far above no word-game:
So, inline essentially keeps all of the effects on the frontier model, with no new attack class. Open to the prepended-conversation route if you prefer it.
A toggle on the converter, not a separate converter. The codewords have to stay in sync with the reconstruction this converter produces, so a separate converter cannot do it; it has to be a mode of this converter.
Note on the mechanism: the harmful phrase still appears once, in the mapping line; the concealment is that the question uses the codeword, splitting the harmful term from the request. This is the paper's word-game, and the numbers above show the lift.
(All numbers are GPT-judge refusal-bypass, not operational harm, consistent with the #2003 assessment.)
Tests and Documentation
use_word_gameparameter indoc/code/converters/1_text_to_text_converters.py; ran JupyText--sync.ruff checkandformatclean;tyreports no errors; full converter and docs test suites pass.cc @rlundeen2 @romanlutz